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1 . A method of determining whether a test sample, having test data T, is 
categorized in one of a number n of classes wherein n is 2 or more, comprising: 

5 extracting a plurality of emerging patterns from a training data set D that has at least 

one instance of each of said n classes of data; 
creating n lists, wherein: 

an fth list of said n lists contains a frequency of occurrence, of each 

emerging pattern EP/(m) from said plurality of emerging patterns that has a non-zero 
1 0 occurrence in an fth class of data; 

using a fixed number, k> of emerging patterns, wherein k is substantially less than a total 
number of emerging patterns in the plurality of emerging patterns, calculating n scores; 
wherein: 

an fth score of said n scores is derived from the frequencies of k emerging 
15 patterns in said fth list that also occur in said test data; and 

deducing which of said n classes of data the test data is categorized in, by selecting the 
highest of said n scores. 

2. The method of claim 1, additionally comprising: 

20 if there is more than one class with the highest score, deducing which of said n classes 

of data the test data is categorized in by selecting the largest of the classes of data having 
the highest score. 

3. The method of claim 1 or 2, wherein: 

25 said k emerging patterns of the fth list that occur in said test data have the highest 

frequencies of occurrence in said fth list amongst all those emerging patterns of said fth list 
that occur in said test data, for all f. 

4. The method of any one of the preceding claims, wherein: 

30 emerging patterns in the fth list are ordered in descending order of said frequency of 

occurrence in said fth class of data, for all L 
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The method of any one of the preceding claims, wherein the rth list has a length l u and k 
is a fixed percentage of the smallest l x . 



6. The method of any one of claims 1 to 4, wherein the rth list has a length l u and A: is a 

n 

5 fixed percentage of ^l, 

7. The method of any one of claims 1 to 4, wherein the fth list has a length ly and k is a 
fixed percentage of any h. 

10 8. The method of any one of claims 5 to 7, wherein said fixed percentage is from about 
1% to about 5% and k is rounded to a nearest integer value. 

9. The method of any one of the preceding claims, wherein n = 2. 

15' 10. The method of any one of claims 1 to 8, wherein n = 3 or more. 

11. A method of determining whether a test sample, having test data T, is categorized in a 
first class or a second class, comprising: 

extracting a plurality of emerging patterns from a training data set D that has at least 
20 one instance of a first class of data and at least one instance of a second class of data; 
creating a first list and a second list wherein: 

said first list contains a frequency of occurrence, f x (m), of each emerging 
pattern EPi(w) from said plurality of emerging patterns that has a non-zero 
occurrence in said first class of data; and 
25 said second list contains a frequency of occurrence, f 2 (m) , of each emerging 

pattern E?2(rn) from said plurality of emerging patterns that has a non-zero 
occurrence in said second class of data; 
using a fixed number, k 9 of emerging patterns, wherein k is substantially less than a total 
number of emerging patterns in the plurality of emerging patterns, calculating: 
30 a first score derived from the frequencies of k emerging patterns in said first list 

that also occur in said test data, and 
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a second score derived from the frequencies of k emerging patterns in said 
second list that also occur in said test data; and 
deducing whether the test data is categorized in the first class of data or in the second 
class of data by selecting the higher of said first score and said second score. 

12. The method of claim 1 1 , additionally comprising: 

if said first score and said second score are equal, deducing whether the test sample is 
categorized in the first class of data or in the second class of data by selecting the larger of 
the first or the second class of data. 

13. The method of claim 1 1 or 12, wherein: 

said k emerging patterns of said first list that occur in said test data have the highest 
frequencies of occurrence in said first list amongst all those emerging patterns of said first 
list that occur in said test data; and 

said k emerging patterns of said second list that occur in said test data have the highest 
frequencies of occurrence in said second list amongst all those emerging patterns of said 
second list that occur in said test data. 

14. The method of any one of claims 1 1 to 13, wherein: 

emerging patterns in said first list are ordered in descending order of said frequency of 
occurrence in said first class of data, and 

emerging patterns in said second list are ordered in descending order of said frequency 
of occurrence in said second class of data. 

15. The method of any one of claims 1 1 to 14, additionally comprising: 
creating a third list and a fourth list, wherein: 

said third list contains a frequency of occurrence, f x (i m ) , in said first class of 
data of each emerging pattern i m from said plurality of emerging patterns that has a 
non-zero occurrence in said first class of data and which also occurs in said test 
data; and 

said fourth list contains a frequency of occurrence, f 2 (j m ) , in said second class 
of data of each emerging pattern j m from said plurality of emerging patterns that has 
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a non-zero occurrence in said second class of data and which also occurs in said test 
data; and wherein 

emerging patterns in said third list are ordered in descending order of said 
frequency of occurrence in said first class of data, and 
5 emerging patterns in said fourth list are ordered in descending order of said 

frequency of occurrence in said second class of data. 

16. The method of claim 1 5, wherein: 

; and 



said first score is given by: Y ^rc 



EP 1 (i„)er 



* f(j) 

10 said second score is given by: ^ Vj m \ 

m=l J2\ m ) 



EP 2 (y„)eT 



17. The method of any one of claims 1 1 to 16, wherein said first list has a length lu and 
said second list has a length / 2 , and k is a fixed percentage of whichever of h and / 2 is 
smaller. 

15 

1 8. The method of any one of claims 1 1 to 16, wherein said first list has a length l u and 
said second list has a length / 2 , and k is a fixed percentage of a sum of l\ and fe. 

19. The method of any one of claims 1 1 to 16, wherein said first list has a length l\ 9 and 
20 said second list has a length / 2 , and is a fixed percentage of any one of h or / 2 . 

20. The method of any one of claims 17 to 19, wherein said fixed percentage is from about 
1% to about 5% and k is rounded to a nearest integer value. 

25 21 . The method of any one of the preceding claims, wherein k is from about 5 to about 50. 

22. The method of claim 2 1 , wherein k is about 20. 

23. The method of any one of the preceding claims, wherein each emerging pattern is 
3 0 expressed as a conjunction of conditions. 
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24. The method of any one of the preceding claims, wherein only left boundary emerging 
patterns are used. 

5 25. The method of any one of claims 1 to 23, wherein only plateau emerging patterns are 
used. 

26. The method of claim 25 wherein only the most specific plateau emerging patterns are 
used. 

10 

27. The method of any one of the preceding claims, wherein each of said emerging patterns 
has a growth rate larger than a threshold, □ . 

28. The method of claim 27 wherein said threshold is from about 2 to about 10. 

15 

29. The method of any one of the preceding claims, wherein each of said emerging patterns 
has a growth rate of oo. 

30. The method of any one of the preceding claims, additionally comprising discretizing 
20 said data set, before said extracting. 

3 1 . The method of claim 30, wherein said discretizing utilizes an entropy-based method. 

32. The method of claim 30 or 3 1 , additionally comprising applying a method of correlation 
25 based feature selection to said data set, after said discretizing. 

33 . The method of claim 30, 3 1 or 32, additionally comprising applying a chi-squared 
method to said data set, after said discretizing. 

30 34. The method of any one of the preceding claims, wherein said data set comprises gene 
expression data. 
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35. The method of claim 34, wherein said gene expression data has been acquired from a 
micro-array apparatus. 



36. The method of any one of the preceding claims, wherein at least one class of data 

5 corresponds to data for a first type of cell and at least another class of data corresponds to 
data for a second type of cell. 

37. The method of claim 36, wherein said first type of cell is a normal cell and said second 
type of cell is a cancerous cell. 

10 

38. The method of any one of the preceding claims, wherein at least one class of data 
corresponds to data for a first population of subjects and at least another class of data 
corresponds to data for a second population of subjects. 

15 39. The method of any one of claims 1 to 33, wherein said data set comprises patient 
medical records. 

40. The method of any one of claims 1 to 33, wherein said data set comprises financial 
transactions. 

20 

41. The method of any one of claims 1 to 33, wherein said data set comprises census data. 

42. The method of any one of claims 1 to 33, wherein said data set comprises 
characteristics of an item selected from the group consisting of: a foodstuff; an article of 

25 manufacture; and a raw material. 

43. The method of any one of claims 1 to 33, wherein said data set comprises 
environmental data, 

30 44. The method of any one of claims 1 to 33, wherein said data set comprises 
meteorological data. 



45. The method of any one of claims 1 to 33, wherein said data set comprises 
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characteristics of a population of organisms. 
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The method of any one of claims 1 to 33, wherein said data set comprises marketing 
data. 

A computer program product for determining whether a test sample, for which there 
exists test data, is categorized in a first class or a second class, wherein the computer 
program product is for use in conjunction with a computer system, the computer program 
product comprising: 

a computer readable storage medium and a computer program mechanism embedded 
therein, the computer program mechanism comprising: 
at least one statistical analysis tool; 
at least one sorting tool; and 
control instructions for: 

accessing a data set that has at least one instance of a first class of data 
and at least one instance of a second class of data; 

extracting a plurality of emerging patterns from said data set; 
creating a first list and a second list wherein, for each of said plurality of 
emerging patterns: 

said first list contains a frequency of occurrence, fj^ , of each 
emerging pattern i from said plurality of emerging patterns that has a 
non-zero occurrence in said first class of data, and 

said second list contains a frequency of occurrence, ff 2 ^ , of each 
emerging pattern i from said plurality of emerging patterns that has a 
non-zero occurrence in said second class of data; 
using a fixed number, k 9 of emerging patterns, wherein k is substantially 
less than a total number of emerging patterns in the plurality of emerging 
patterns, calculating: 

a first score derived from the frequencies of A: emerging patterns 
in said first list that also occur in said test data, and 

a second score derived from the frequencies of k emerging 
patterns in said second list that also occur in said test data; and 
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deducing whether the test sample is categorized in the first class of data 
or in the second class of data by selecting the higher of the first score and the 
second score. 

5 48. The computer program product of claim 47, additionally comprising instructions for: 
if said first score and said second score are equal, deducing whether the test sample is 
categorized in the first class of data or in the second class of data by selecting the larger of 
the first or the second class of data. 

10 49. The computer program product of claim 47 or 48, wherein: 

said k emerging patterns of said first list that occur in said test data have the highest 
frequencies of occurrence in said first list amongst all those emerging patterns of said first 
list that occur in said test data; and 

said k emerging patterns of said second list that occur in said test data have the highest 
15 frequencies of occurrence in said second list amongst all those emerging patterns of said 
second list that occur in said test data. 

50. The computer program product of any one of claims 47 to 49, further comprising 
control instructions for: 

20 ordering emerging patterns in said first list in descending order of said frequency of 

occurrence in said first class of data, and 

ordering emerging patterns in said second list in descending order of said frequency of 
occurrence in said second class of data. 

25 51. The computer program product of any one of claims 47 to 50, additionally comprising 
instructions for: 

creating a third list and a fourth list, wherein: 

said third list contains a frequency of occurrence, f x (i OT ) , in said first class of 

data of each emerging pattern i m from said plurality of emerging patterns that has a 
30 non-zero occurrence in said first class of data and which also occurs in said test 

data; and 

said fourth list contains a frequency of occurrence, f 2 (j m ) , in said second class 
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of data of each emerging pattern j m from said plurality of emerging patterns that has 
a non-zero occurrence in said second class of data and which also occurs in said test 
data; and wherein 

emerging patterns in said third list are ordered in descending order of 
5 said frequency of occurrence in said first class of data, and 

emerging patterns in said fourth list are ordered in descending order of 
said frequency of occurrence in said second class of data. 

52. The computer program product of claim 5 1 , further comprising instructions for 
10 calculating: 

k 



said first score according to the formula: j\ l) m { 

* f(j) 

said second score according to the formula: V l}\ 



; and 



53. The computer program product of any one of claims 47 to 52, wherein k is from 
1 5 about 5 to about 50. 

54. The computer program product of any one of claims 47 to 53, wherein only left 
boundary emerging patterns are used. 

20 55. The computer program product of any one of claims 47 to 54, wherein each of 

said emerging patterns has a growth rate of oo. 

56. The computer program product of any one of claims 47 to 55, wherein said data 
set comprises data selected from the group consisting of: gene expression data, patient 

25 medical records, financial transactions, census data, characteristics of an article of 

manufacture, characteristics of a foodstuff, characteristics of a raw material, meteorological 
data, environmental data, and characteristics of a population of organisms. 

57. A system for determining whether a test sample, for which there exists test data, 
30 is categorized in a first class or a second class, the system comprising: 
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at least one memory, at least one processor and at least one user interface, all of 
which are connected to one another by at least one bus; 

wherein said at least one processor is configured to: 

access a data set that has at least one instance of a first class of data and at least 
5 one instance of a second class of data; 

extract a plurality of emerging patterns from said data set; 
create a first list and a second list wherein, for each of said plurality of emerging 
patterns: 

said first list contains a frequency of occurrence, , of each emerging 
1 0 pattern i from said plurality of emerging patterns that has a non-zero 

occurrence in said first class of data, and 

said second list contains a frequency of occurrence, ff 2 * , of each 
emerging pattern i from said plurality of emerging patterns that has a non- 
zero occurrence in said second class of data; 
1 5 use a fixed number, £, of emerging patterns, wherein k is substantially less than 

a total number of emerging patterns in the plurality of emerging patterns, to 
calculate: 

a first score derived from the frequencies of k emerging patterns in said 
first list that also occur in said test data, and 
20 a second score derived from the frequencies of k emerging patterns in 

said second list that also occur in said test data; and 
deduce whether the test sample is categorized in the first class of data or in the 
second class of data by selecting the higher of the first score and the second score. 

25 58. The system of claim 57, wherein said processor is additionally configured to: 

if said first score and said second score are equal, deduce whether the test sample is 
categorized in the first class of data or in the second class of data by selecting the larger of 
the first or the second class of data 

30 59. The system of claim 57 or 58, wherein: 

said k emerging patterns of said first list that occur in said test data have the highest 
frequencies of occurrence in said first list amongst all those emerging patterns of said first 
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list that occur in said test data; and 

said k emerging patterns of said second list that occur in said test data have the highest 
frequencies of occurrence in said second list amongst all those emerging patterns of said 
second list that occur in said test data. 

5 

60. The system of claim 57, 58 or 59, wherein said processor is additionally 
configured to: 

oider emerging patterns in said first list in descending order of said frequency of 
occurrence in said first class of data, and 
10 order emerging patterns in said second list in descending order of said frequency of 

occurrence in said second class of data. 

61 . The system of any one of claims 57 to 60, wherein said processor is additionally 
configured to: 

1 5 create a third list and a fourth list, wherein: 

said third list contains a frequency of occurrence, f x (i OT ) , in said first class of 
data of each emerging pattern i m from said plurality of emerging patterns that has a 
non-zero occurrence in said first class of data and which also occurs in said test 
data; and 

20 said fourth list contains a frequency of occurrence, f 2 (j m ) , in said second class 

of data of each emerging pattern j m from said plurality of emerging patterns that has 
a non-zero occurrence in said second class of data and which also occurs in said test 
data; and wherein 

emerging patterns in said third list are ordered in descending order of 
25 said frequency of occurrence in said first class of data, and 

emerging patterns in said fourth list are ordered in descending order of 
said frequency of occurrence in said second class of data. 

62. The system of claim 61, wherein said processor is additionally configured to 
30 calculate: 



said 



first score according to the formula: V 1, ^ 



; and 

EP.(/„)er 
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said second score according to the formula: Y* ■ , m J 

fiVn) 



EP 2 C/ a )er 



63. The system of any one of claims 57 to 62, wherein A: is from about 5 to about 50. 



64. The system of any one of claims 57 to 63, wherein only left boundary emerging 
patterns are used. 

65. The system of any one of claims 57 to 64, wherein each of said emerging 
1 0 patterns has a growth rate of oo. 

66. The system of any one of claims 57 to 65, wherein said data set comprises data 
selected from the group consisting of: gene expression data, patient medical records, 
financial transactions, census data, characteristics of an article of manufacture, 

1 5 characteristics of a foodstuff, characteristics of a raw material, meteorological data, 
environmental data, and characteristics of a population of organisms. 

67. A method of determining whether a sample cell is cancerous, comprising: 
extracting a plurality of emerging patterns from a data set that comprises gene 

20 expression data for a plurality of cancerous cells and a gene expression data for a plurality 
of normal cells; 

creating a first list and a second list wherein: 

said first list contains a frequency of occurrence, ff^ , of each emerging pattern 
j from said plurality of emerging patterns that has a non-zero occurrence in said 
25 cancerous cells, and 

said second list contains a frequency of occurrence, ff 2 ^ , of each emerging 
pattern i from said plurality of emerging patterns that has a non-zero occurrence in 
said normal cells; 

using a fixed number, k 9 of emerging patterns, wherein k is substantially less 
30 than a total number of emerging patterns in the plurality of emerging patterns, 

calculating: 
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a first score derived from the frequencies of k emerging patterns in said 
first list that also occur in said test data, and 

a second score derived from the frequencies of k emerging patterns in 
said second list that also occur in said test data; and 
5 deducing whether the sample cell is cancerous if said first score is higher than 

said second score. 

68. A method of determining whether a test sample, having test data T, is categorized in 
one of a number of classes, substantially as hereinbefore described with reference to and as 

1 0 illustrated in the accompanying drawings. 

69. The computer program product of any one of claims 47 to 56, operable according to the 
method of any one of claims 1 to 46, 67 and 68. 

15 70. A computer program product operable according to the method of any one of claims 1 
to 46, 67 and 68. 

71. A computer program product for determining whether a test sample, for which there 
exists test data, is categorized in one of a number of classes, constructed and arranged to 

20 operate substantially as hereinbefore described with reference to and as illustrated in the 
accompanying drawings. 

72. The system of any one of claims 57 to 66, operable according to the method of any one 
of claims 1 to 46, 67 and 68. 

25 

73. A system for determining whether a test sample, for which there exists test data, is 
categorized in one of a number of classes, constructed and arranged to operate substantially 
as hereinbefore described with reference to and as illustrated in the accompanying 
drawings. 

30 

74. A system operable according to the method of any one of claims 1 to 46, 67 and 68. 

75. The system of any one of claims 57 to 66 and 71 to 73, for use with the computer 
program product of any one of claims 47 to 56 and 69 to 71. 
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